PaliGemma/[PaliGemma_1]Zero_shot_object_detection_in_videos.ipynb (586 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "Tce3stUlHN0L"
},
"source": [
"##### Copyright 2024 Google LLC."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"id": "tuOe1ymfHZPu"
},
"outputs": [],
"source": [
"# @title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
"# you may not use this file except in compliance with the License.\n",
"# You may obtain a copy of the License at\n",
"#\n",
"# https://www.apache.org/licenses/LICENSE-2.0\n",
"#\n",
"# Unless required by applicable law or agreed to in writing, software\n",
"# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
"# See the License for the specific language governing permissions and\n",
"# limitations under the License."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "WYDXtkholMVS"
},
"source": [
"#### This notebook is created by [Nitin Tiwari](https://linkedin.com/in/tiwari-nitin).\n",
"\n",
"#### **Social links:**\n",
"* [LinkedIn](https://linkedin.com/in/tiwari-nitin)\n",
"* [GitHub](https://github.com/NSTiwari)\n",
"* [Twitter](https://x.com/NSTiwari21)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "2--uLhHDlPPJ"
},
"source": [
"# Zero-shot Object Detection in videos"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "VPLU1zrDlSDJ"
},
"source": [
"This notebook guides you to perform zero-shot object detection on videos using [PaliGemma](https://ai.google.dev/gemma/docs/paligemma) and draw the inferences using OpenCV and PIL.\n",
"\n",
"<table align=\"left\">\n",
" <td>\n",
" <a target=\"_blank\" href=\"https://colab.research.google.com/github/google-gemini/gemma-cookbook/blob/main/PaliGemma/[PaliGemma_1]Zero_shot_object_detection_in_videos.ipynb\"><img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />Run in Google Colab</a>\n",
" </td>\n",
"</table>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "cq2882eIlczp"
},
"source": [
"### Get access to PaliGemma\n",
"\n",
"Before using PaliGemma for the first time, you must request access to the model through Hugging Face by completing the following steps:\n",
"\n",
"1. Log in to [Hugging Face](https://huggingface.co), or create a new Hugging Face account if you don't already have one.\n",
"2. Go to the [PaliGemma model card](https://huggingface.co/google/paligemma-3b-mix-224) to get access to the model.\n",
"3. Complete the consent form and accept the terms and conditions.\n",
"\n",
"To generate a Hugging Face token, open your [**Settings** page in Hugging Face](https://huggingface.co/settings), choose **Access Tokens** option in the left pane and click **New token**. In the next window that appears, give a name to your token and choose the type as **Write** to get the write access.\n",
"\n",
"Then, in Colab, select **Secrets** (🔑) in the left pane and add your Hugging Face token. Store your Hugging Face token under the name `HF_TOKEN`."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "qV1XyFxHlfGB"
},
"source": [
"### Select the runtime\n",
"\n",
"To complete this tutorial, you'll need to have a Colab runtime with sufficient resources to run the PaliGemma model. In this case, you can use a T4 GPU:\n",
"\n",
"1. In the upper-right of the Colab window, click the **▾ (Additional connection options)** dropdown menu.\n",
"1. Select **Change runtime type**.\n",
"1. Under **Hardware accelerator**, select **T4 GPU**."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4Y6WdnIIEpOh"
},
"source": [
"### Step 1: Install libraries"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"id": "l5so74dCEO5B"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m119.8/119.8 MB\u001b[0m \u001b[31m9.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m309.4/309.4 kB\u001b[0m \u001b[31m26.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m251.6/251.6 kB\u001b[0m \u001b[31m23.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m21.3/21.3 MB\u001b[0m \u001b[31m66.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25h"
]
}
],
"source": [
"!pip install bitsandbytes transformers accelerate peft -q"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "y4zYJcmnlv3Z"
},
"source": [
"### Step 2: Set environment variables for Hugging Face token"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"id": "ggzRPV54lxnY"
},
"outputs": [],
"source": [
"import os\n",
"from google.colab import userdata\n",
"\n",
"os.environ[\"HF_TOKEN\"] = userdata.get('HF_TOKEN')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "IN0xf7VyEy-I"
},
"source": [
"### Step 3: Load pre-trained PaliGemma base model"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"id": "62GzNxTdE1hL"
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "9915dd0a1d88428f84e7783a4127fc31",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"config.json: 0%| | 0.00/1.03k [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "6e9ad8d9c9bd434c8211cb9efbd8c7c8",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"model.safetensors.index.json: 0%| | 0.00/62.6k [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "51cc44c7d0cb4640808c10da3e356673",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Downloading shards: 0%| | 0/3 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "36105f73ab144535bdb23456a10dbf9e",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"model-00001-of-00003.safetensors: 0%| | 0.00/4.95G [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "73e9a813057349939e31717e87e66015",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"model-00002-of-00003.safetensors: 0%| | 0.00/5.00G [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "fa0e68811636489a82cca3d1fbd4010c",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"model-00003-of-00003.safetensors: 0%| | 0.00/1.74G [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.\n",
"Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use\n",
"`config.hidden_activation` if you want to override this behaviour.\n",
"See https://github.com/huggingface/transformers/pull/29402 for more details.\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "d25921ffadb341708a6a191ca891951f",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "eb4e552bd2884759a69836c37af2a015",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"generation_config.json: 0%| | 0.00/137 [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "72c64d188e99404b953ef33339c6bbf7",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"preprocessor_config.json: 0%| | 0.00/699 [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "1786e7824bb8465c83ec08304bd0534f",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"tokenizer_config.json: 0%| | 0.00/40.0k [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "f56d4ee1d4a1400ea495cf8a8ea412ef",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"tokenizer.model: 0%| | 0.00/4.26M [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "c90a8f2524ff47e69c6a53371eee2544",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"tokenizer.json: 0%| | 0.00/17.5M [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "a4893aef62004152bea2e53d59e951d6",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"added_tokens.json: 0%| | 0.00/24.0 [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "27f0bca5215e48bfba527d3ab3236521",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"special_tokens_map.json: 0%| | 0.00/607 [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from transformers import AutoTokenizer, PaliGemmaForConditionalGeneration, PaliGemmaProcessor\n",
"import torch\n",
"\n",
"device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
"model_id = \"google/paligemma-3b-mix-224\"\n",
"model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16)\n",
"processor = PaliGemmaProcessor.from_pretrained(model_id)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Rm8_--4kE9Px"
},
"source": [
"### Step 4: Function to draw inference on videos"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"id": "sFFBnK06FvSK"
},
"outputs": [],
"source": [
"from PIL import Image, ImageDraw, ImageFont\n",
"import cv2\n",
"import numpy as np\n",
"\n",
"def draw_bounding_box(image, coordinates, label, width, height):\n",
" global label_colors\n",
" y1, x1, y2, x2 = coordinates\n",
" y1, x1, y2, x2 = map(round, (y1*height, x1*width, y2*height, x2*width))\n",
"\n",
" text_size, _ = cv2.getTextSize(label, cv2.FONT_HERSHEY_SIMPLEX, 1, 3)\n",
" text_width, text_height = text_size\n",
"\n",
" text_x = x1 + 2\n",
" text_y = y1 - 5\n",
"\n",
" font_scale = 1\n",
" label_rect_width = text_width + 8\n",
" label_rect_height = int(text_height * font_scale)\n",
"\n",
" color = label_colors.get(label, None)\n",
" if color is None:\n",
" color = np.random.randint(0, 256, (3,)).tolist()\n",
" label_colors[label] = color\n",
"\n",
" cv2.rectangle(image, (x1, y1 - label_rect_height), (x1 + label_rect_width, y1), color, -1)\n",
"\n",
" thickness = 2\n",
" cv2.putText(image, label, (text_x, text_y), cv2.FONT_HERSHEY_SIMPLEX, font_scale, (255, 255, 255), thickness, cv2.LINE_AA)\n",
"\n",
" cv2.rectangle(image, (x1, y1), (x2, y2), color, 2)\n",
" return image"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ozB-dLGlmCoK"
},
"source": [
"### Step 5: Configure the input video and text prompt"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"id": "LZQ1pGAWmNEJ"
},
"outputs": [],
"source": [
"input_video = 'input_video.mp4' # @param {type:\"string\"}\n",
"\n",
"prompt = \"detect person, phone, bottle\" # @param {type: \"string\"}\n",
"\n",
"output_file = 'output_video.avi' # @param {type: \"string\"}"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "bQpFZOBTmgNi"
},
"source": [
"### Step 6: Pass the input video and text prompt to PaliGemma and draw inferences"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "7Z6SHLs2E_dD"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Output video output_video.avi saved to disk.\n"
]
}
],
"source": [
"# Open the input video file.\n",
"cap = cv2.VideoCapture(input_video)\n",
"\n",
"fourcc = cv2.VideoWriter_fourcc(*'XVID')\n",
"out = cv2.VideoWriter(output_file, fourcc, 20.0, (int(cap.get(3)), int(cap.get(4))))\n",
"\n",
"label_colors = {}\n",
"\n",
"while(True):\n",
" ret, frame = cap.read()\n",
" if not ret:\n",
" break\n",
"\n",
" # Convert the frame to a PIL Image.\n",
" img = Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))\n",
"\n",
" # Send text prompt and image as input.\n",
" inputs = processor(text=prompt, images=img,\n",
" padding=\"longest\", do_convert_rgb=True, return_tensors=\"pt\").to(\"cuda\")\n",
" model.to(device)\n",
" inputs = inputs.to(dtype=model.dtype)\n",
"\n",
" # Get output.\n",
" with torch.no_grad():\n",
" output = model.generate(**inputs, max_length=496)\n",
"\n",
" paligemma_response = processor.decode(output[0], skip_special_tokens=True)[len(prompt):].lstrip(\"\\n\")\n",
" detections = paligemma_response.split(\" ; \")\n",
"\n",
" # Parse the output bounding box coordinates\n",
" parsed_coordinates = []\n",
" labels = []\n",
"\n",
" for item in detections:\n",
" detection = item.replace(\"<loc\", \"\").split()\n",
"\n",
" if len(detection) >= 2:\n",
" coordinates_str = detection[0].replace(\",\", \"\")\n",
" label = detection[1]\n",
" if \"<seg\" in label:\n",
" continue\n",
" else:\n",
" labels.append(label)\n",
" else:\n",
" # No label detected, skip the iteration.\n",
" continue\n",
"\n",
" coordinates = coordinates_str.split(\">\")\n",
" coordinates = coordinates[:4]\n",
"\n",
" if coordinates[-1] == '':\n",
" coordinates = coordinates[:-1]\n",
"\n",
"\n",
" coordinates = [int(coord)/1024 for coord in coordinates]\n",
" parsed_coordinates.append(coordinates)\n",
"\n",
" width = img.size[0]\n",
" height = img.size[1]\n",
"\n",
" # Draw bounding boxes on the frame\n",
" image = cv2.cvtColor(np.array(img), cv2.COLOR_RGB2BGR)\n",
" for coordinates, label in zip(parsed_coordinates, labels):\n",
" output_frame = draw_bounding_box(frame, coordinates, label, width, height)\n",
"\n",
" # Write the frame to the output video\n",
" out.write(output_frame)\n",
"\n",
" # Exit on pressing 'q'\n",
" if cv2.waitKey(1) & 0xFF == ord('q'):\n",
" break\n",
"\n",
"# Release the video capture, output video writer, and destroy the window\n",
"cap.release()\n",
"out.release()\n",
"cv2.destroyAllWindows()\n",
"print(\"Output video \" + output_file + \" saved to disk.\")"
]
}
],
"metadata": {
"accelerator": "GPU",
"colab": {
"name": "[PaliGemma_1]Zero_shot_object_detection_in_videos.ipynb",
"toc_visible": true
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}